-
Notifications
You must be signed in to change notification settings - Fork 335
feat(writer): Add clustered and fanout writer #1735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
crates/iceberg/src/writer/mod.rs
Outdated
/// Build the iceberg writer. | ||
async fn build(self) -> Result<Self::R>; | ||
/// Build the iceberg writer for an optional partition key. | ||
async fn build_with_partition(self, partition_key: Option<PartitionKey>) -> Result<Self::R>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change. I believe this is necessary because:
-
IcebergWriter
is supposed to generateDataFile
that always hold a partition value according to iceberg spec. -
The existing code store partition value in the builder directly, making
builder.clone()
useless:
let builder = IcebergWriterBuilder::new(partition_A);
let writer_A = builder.build();
... // write to partition A
// done with partition A and now we need to write to partition B
// this is wrong because partition value A is still stored in the builder
let writer_B = builder.clone().build()
An alternative is to add a new method clone_with_partition()
but that would also be a breaking change and it's less clean compared to build_with_partition()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this change, but I want a further change as following:
async fn build(&self, partition_key: Option<PartitionKey>) -> Result<Self::R>
If the builder could be reused for creating actual IcebergWriter
, I want to avoid cloning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this is not changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I thought I replied earlier but it seems my comment didn't go thru.
I think your advice makes sense but it would be better to fix this in a separate PR.
Changing IcebergWriterBuilder::builder
to &self
means we will need to change all the writer builder along the writers chain to use &self
(IcebergWriterBuilder -> RollingFileWriterBuilder -> FileWriterBuilder). Otherwise, the builder would look like the following and won't help too much since we still need to clone the inner. This also can cause further confusion to users since we have different semantics for each writer builder
async fn build(&self, partition_key: Option<PartitionKey>) -> Result<Self::R> {
Ok(DataFileWriter {
inner: Some(self.inner.clone().build()), // inner builder still needs to be cloned
partition_key,
})
}
Please let me know your thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaination, that makes sense to me. It would be better to create an issue to track it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a tracking issue here: #1753
crates/iceberg/src/writer/partitioning/clustered_data_writer.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CTTY for this pr! Just finished first round of review, and I think we are on the right track!
crates/iceberg/src/writer/mod.rs
Outdated
/// Build the iceberg writer. | ||
async fn build(self) -> Result<Self::R>; | ||
/// Build the iceberg writer for an optional partition key. | ||
async fn build_with_partition(self, partition_key: Option<PartitionKey>) -> Result<Self::R>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this change, but I want a further change as following:
async fn build(&self, partition_key: Option<PartitionKey>) -> Result<Self::R>
If the builder could be reused for creating actual IcebergWriter
, I want to avoid cloning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CTTY for this pr!
Which issue does this PR close?
What changes are included in this PR?
New:
partitioning
module withPartitioningWriter
traitClusteredWriter
: Optimized for pre-sorted data, requires writing in partition orderFanoutWriter
: Flexible writer that can handle data from any partition at any timeModification:
DataFileWriterBuilder
to support dynamic partition assignmentAre these changes tested?
Added unit tests